An Efficient Stream-based Join to Process End User Transactions in Real-Time Data Warehousing
نویسندگان
چکیده
In the field of real-time data warehousing semistream processing has become a potential area of research since last one decade. One important operation in semi-stream processing is to join stream data with a slowly changing diskbased master data. A join operator is usually required to implement this operation. This join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semistream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. In this paper we propose a novel Semi-Stream Join (SSJ) using a new cache module. The algorithm is more appropriate for skewed distributions, and we present results for Zipfian distributions of the type that appears in many applications. We present the cost model for our SSJ and validate it with experiments. Based on the cost model we also tune the algorithm up to a maximum performance. We conduct a rigorous experimental study to test our algorithm. Our experiments show that SSJ outperforms MESHJOIN significantly. Subject Categories and Descriptors: H.2.7 [Database Administration]; Data Warehouse and Repository: H.2.4 [Systems]; Transaction Processing I.2H.2 General Terms: Data Warehousing, Data Processing
منابع مشابه
HYBRIDJOIN for Near-Real-Time Data Warehousing
An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a disk-based relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely prop...
متن کاملX-HYBRIDJOIN for Near-Real-Time Data Warehousing
In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, nearreal-time data integration is required. An important phase in near-realtime data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based...
متن کاملبهبود بهروزرسانی پایگاه داده تحلیلی نیمهآنی
Near-real time data warehouse gives the end users the essential information to achieve appropriate decisions. Whatever the data are fresher in it, the decision would have a better result either. To achieve a fresh and up-to-date data, the changes happened in the side of source must be added to the data warehouse with little delay. For this reason, they should be transformed in to the data wareh...
متن کاملTuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing
Near-real-time data warehousing defines how updates from data sources are combined and transformed for storage in a data warehouse as soon as the updates occur. Since these updates are not in warehouse format, they need to be transformed and a join operator is usually required to implement this transformation. A stream-based algorithm called X-HYBRIDJOIN (Extended Hybrid Join), with a favorable...
متن کاملOptimizing Queue-Based Semi-Stream Joins with Indexed Master Data
In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typically works under the constraint of limited ma...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JDIM
دوره 12 شماره
صفحات -
تاریخ انتشار 2014